iT邦幫忙

2023 iThome 鐵人賽

DAY 27
0
AI & Data

生資的路且重且遠,我要被鴨垮了Q系列 第 27

Day27. GATK Germline Pipeline from Data Preparation to Variant Calling (Code)

  • 分享至 

  • xImage
  •  

我們開場先來上兩張圖,第一張圖是設計用來處理一個研究群體的pipeline,第二張圖是單個sample,"per-sample" 步驟主要針對單個樣本進行變異體呼叫,而 "cohort" 步驟則是針對多個樣本進行聯合基因分型,"per-sample" 和 "cohort" 之間的主要區別在於處理數據的層次和目的。

Main steps for Germline Cohort Data

Call variants per-sample

那我們就來針對單個sample下載,然後執行吧

#!/bin/bash

# By human WGS paired-end reads
# GATK4 best practices workflow - https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-
# Sample list from : https://genome.cshlp.org/content/suppl/2013/09/24/gr.156034.113.DC1/Illumina_reads.txt

# Download data
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00150/sequence_read/ERR015533_1.filt.fastq.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00150/sequence_read/ERR015533_2.filt.fastq.gz

echo "Running ~~"

###################################################
#Prepare some reference files (ONLY Need ONCE) 
###################################################

# Download reference files
mkdir -p ~/Desktop/demo/reference_files/hg38/
wget -P ~/Desktop/demo/reference_files/hg38/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip ~/Desktop/demo/reference_files/hg38/hg38.fa.gz

# Index reference - create .fai file before running HaplotypeCaller
samtools faidx ~/Desktop/demo/reference_files/hg38/hg38.fa

# Reference dictionary - create .dict file before running HaplotypeCaller
gatk CreateSequenceDictionary R=~/Desktop/demo/reference_files/hg38/hg38.fa O=~/Desktop/demo/reference_files/hg38/hg38.dict

# Download known sites files for BQSR from GATK resource bundle
mkdir -p ~/Desktop/demo/reference_files/hg38/
wget -P ~/Desktop/demo/reference_files/hg38/ https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
wget -P ~/Desktop/demo/reference_files/hg38/ https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx

######################################################
#Variant calling -- Start
######################################################
# Define directories
ref="/Users/Desktop/demo/reference_files/hg38/hg38.fa"
known_sites="/Users/Desktop/demo/reference_files/hg38/Homo_sapiens_assembly38.dbsnp138.vcf"
aligned_reads="/Users/Desktop/demo/VC/aligned_reads"
reads="/Users/Desktop/demo/VC/reads"
results="/Users/Desktop/demo/VC/results"
data="/Users/Desktop/demo/VC/data"

# -------------------
# STEP 1: QC - Run FastQC
# -------------------

echo "STEP 1: QC Check - Run FastQC"
fastqc ${reads}/ERR015533_1.filt.fastq.gz -o ${reads}/
fastqc ${reads}/ERR015533_2.filt.fastq.gz -o ${reads}/

# If the results from FastQC indicate that the quality does not look okay, you should consider trimming or potentially discarding this sample. In this case, no trimming is required, as the quality looks okay.

# --------------------------------------
# STEP 2: Map to Reference using BWA-MEM
# --------------------------------------

echo "STEP 2: Map to Reference genome by using BWA-MEM"

# Index reference with BWA
bwa index ${ref}

# BWA alignment
bwa mem -t 4 -R "@RG\tID:ERR015533\tPL:ILLUMINA\tSM:ERR015533" ${ref} ${reads}/ERR015533_1.filt.fastq.gz ${reads}/ERR015533_2.filt.fastq.gz > ${aligned_reads}/ERR015533.paired.sam

# -----------------------------------------
# STEP 3: Mark Duplicates and Sort - GATK4
# -----------------------------------------

echo "STEP 3: Mark Duplicates and Sort - GATK4"
gatk MarkDuplicatesSpark \
    -I ${aligned_reads}/ERR015533.paired.sam \
    -O ${aligned_reads}/ERR015533_sorted_dedup_reads.bam


# ----------------------------------
# STEP 4: Base Quality Recalibration
# ----------------------------------

echo "STEP 4: Base Quality Recalibration"

# 1. Build the model
gatk BaseRecalibrator \
    -I ${aligned_reads}/ERR015533_sorted_dedup_reads.bam \
    -R ${ref} \
    --known-sites ${known_sites} \
    -O ${data}/recal_data.table

# 2. Apply the model to adjust the base quality scores
gatk ApplyBQSR \
    -I ${aligned_reads}/ERR015533_sorted_dedup_reads.bam \
    -R ${ref} --bqsr-recal-file ${data}/recal_data.table \
    -O ${aligned_reads}/ERR015533_sorted_dedup_bqsr_reads.bam

# -----------------------------------------------
# STEP 5: Collect Alignment & Insert Size Metrics
# -----------------------------------------------

echo "STEP 5: Collect Alignment & Insert Size Metrics"

gatk CollectAlignmentSummaryMetrics R=${ref} I=${aligned_reads}/ERR015533_sorted_dedup_bqsr_reads.bam O=${aligned_reads}/alignment_metrics.txt

gatk CollectInsertSizeMetrics INPUT=${aligned_reads}/ERR015533_sorted_dedup_bqsr_reads.bam OUTPUT=${aligned_reads}/insert_size_metrics.txt HISTOGRAM_FILE=${aligned_reads}/insert_size_histogram.pdf

# ----------------------------------------------
# STEP 6: Call Variants - GATK HaplotypeCaller
# ----------------------------------------------

echo "STEP 6: Call Variants - GATK HaplotypeCaller"

gatk HaplotypeCaller \
    -R ${ref} \
    -I ${aligned_reads}/ERR015533_sorted_dedup_bqsr_reads.bam \
    -O ${results}/raw_variants.vcf

# Extract SNPs & INDELs
gatk SelectVariants \
    -R ${ref} \
    -V ${results}/raw_variants.vcf --select-type SNP \
    -O ${results}/raw_snps.vcf

gatk SelectVariants \
    -R ${ref} \
    -V ${results}/raw_variants.vcf --select-type INDEL \
    -O ${results}/raw_indels.vcf


Reference:

Germline short variant discovery (SNPs + Indels)


上一篇
Day26. 去除Adapter和低質量序列(Trim)
下一篇
Day28. 詳細解釋Variant calling 步驟
系列文
生資的路且重且遠,我要被鴨垮了Q30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言